-
Notifications
You must be signed in to change notification settings - Fork 32
Adds lm_eval to evaluations #282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Log-likelihood calculation in the |
" passed to the Fast-LLM lm_eval model wrapper.", | ||
) | ||
|
||
add_bos_token: bool = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be determined by the model itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as i understand from comments across the code some models just underperform without it but by default it is off for all models
print(f"Determined largest batch size: {self.batch_sizes[sched]}") | ||
return self.batch_sizes[sched] | ||
|
||
def _loglikelihood_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It computes per-token log-likelihoods, which are used by lm_eval to calculate various evaluation metrics. The function applies softmax to the returned logits to obtain token-level probabilities. Currently, the logits are moved to the CPU for post-processing, which can be slow but helps avoid GPU memory pressure.
There is a TODO to distribute this function and return only per-token log-probabilities from each shard instead of gathering all logits centrally. However, I prefer to implement that in a separate PR.
# Note: metrics modified in-place | ||
if self._wandb is not None: | ||
import wandb | ||
|
||
wandb.log(metrics, step=completed_steps) # noqa | ||
wandb.log(metrics, step=completed_steps, commit=commit) # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach allows the code to log multiple times for the same step. When the final log for that step is recorded, all previous logs for that step become immediately visible in Weights & Biases (wandb). There's no longer a need to wait for the next step to start logging for the previous step's logs to appear. Previously, logs for a step would remain hidden until logging began for the next step.
β¦d only internally, made max_lenght settable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's getting close, but the lack of tests will be a problem. Do you intend to add some?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main code looks ready, but I have some concerns with the tests. If we want to merge now I suggest moving the test changes to another PR, then I can approve right away.
tests/models/test_lm_eval_simple.py
Outdated
] | ||
|
||
|
||
@pytest.mark.extra_slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this take? It would be worrying not to have any tests other than extra-slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very long 40-80 sec per test
|
||
|
||
@pytest.fixture(scope="function") | ||
def get_lm_eval_config(tokenizer_path, monkeypatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I managed to reduce test time by trimming bloat in lm_eval, reducing evaluation size and restricting task to wikitest (others are much slower). Should still be enough since we're only testing fast-llm's wrapper, not lm_eval itself. It's now below 10 seconds so we no longer need to mark as extra_slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thanks
tests/models/test_lm_eval.py
Outdated
run_test_script_for_all_models( | ||
get_lm_eval_config(run_test_script_base_path / "test_lm_eval_in_training"), | ||
compare="test_lm_eval_in_training", | ||
prepare_fn=_prepare_resume_fn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense, we're starting an evaluation and not resuming training, so we should use the checkpoint as a pretrained model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I created a function that simply copies the training run with a checkpoint, but the evaluation still starts from the training config. I havenβt tested it starting from a pretrained checkpoint yet β maybe we should leave that for another PR?
cartesia_pytorch>=0.0.2 | ||
|
||
GENERATION = | ||
lm_eval>=0.4.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added dependency, packages we use should be in setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the section should be named 'evaluations', since it's not specifically for generation?
@@ -392,3 +394,23 @@ def enabled(self) -> bool: | |||
@property | |||
def interrupted(self): | |||
return self._interrupted | |||
|
|||
|
|||
def set_global_variables(disable_torch_dynamo: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this into a separate method so we can call in more places ex. conftest and cli things that don't go through Run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ready to merge once remaining comment is addressed and tests pass
β¨ Description
Add lm_eval integration which enables running evaluation directly from an in-memory model. Currently, only data parallelism is supported.
Closes #199
π Type of change
Select all that apply:
π Changes
List the key changes introduced in this PR:
Improved Weights & Biases (wandb) Logging
The final log for each step now explicitly commits the data, ensuring that the step is immediately visible in the wandb interface. Previously, a step would only appear after logging began for the following step.
Integration of
lm_eval
EvaluatorAdded support for
lm_eval
as a built-in evaluation method. This enables running standard language model benchmarks during or after training using the Evaluation Harness.Extended Support for Distributed Primitives
Introduced custom distributed utilities adapted from PyTorch to support non-standard process groups to support lm_eval integration.
Added User Guide for Evaluations
Included documentation explaining how to configure and use evaluators such as
loss
andlm_eval
, both during training and in standalone evaluation mode.β Checklist
Make sure the following tasks are completed before submitting the PR:
General
Testing